fix(llmobs): openai-java payload mapping for responses, tool metadata, and prompt tracking by ygree · Pull Request #10644 · DataDog/dd-trace-java

ygree · 2026-02-19T21:45:14Z

What Does This Do

Aligns OpenAI Java LLMObs span payloads with expected intake/system-test schema by:

Adding/filling missing LLMObs tags:
- _ml_obs_tag.integration
- _ml_obs_tag.source
- _ml_obs_tag.ddtrace.version
- _ml_obs_tag.error
- _ml_obs_tag.error_type
Ensuring model_name (and stable placeholder output where applicable) is set on error paths for
chat/completions/embeddings/responses.
Expanding Responses instrumentation:
- prompt tracking (input.prompt, variables, chat_template)
- tool definition extraction (tool_definitions)
- tool call/result extraction across function/custom/MCP outputs
- metadata normalization (stream, tool_choice, text.verbosity, etc.)
Updating LLMObs mapper payload shape:
- writes _dd map with span/trace ids
- nests error fields under meta.error
- supports map-based LLM input serialization (messages + prompt)
- remaps tool_definitions into meta.

Motivation

OpenAI/LLMObs system tests exposed schema and tag mismatches in Java payloads (especially response spans, tool metadata, error mapping, and prompt tracking structure). This change brings Java output in line with expected LLMObs intake contract and behavior.

Additional Notes

openai-java-3.0 min version updated from 3.0.0 to 3.0.1.
ResponseTextConfig fun verbosity(): Optional<Verbosity> was added in 3.0.1 openai/openai-java@c1de354#diff-6b385fb153d457757ba112e6117593cb59da6af308cce0f9b6f26e3885befc6cR73

DataDog/dd-apm-test-agent#280
DataDog/system-tests#6364

Contributor Checklist

Format the title according to the contribution guidelines
Assign the type: and (comp: or inst:) labels in addition to any other useful labels
Avoid using close, fix, or any linking keywords when referencing an issue
Use solves instead, and assign the PR milestone to the issue
Update the CODEOWNERS file on source file addition, migration, or deletion
Update public documentation with any new configuration flags or behaviors

Jira ticket: [PROJ-IDENT]

Note: Once your PR is ready to merge, add it to the merge queue by commenting /merge. /merge -c cancels the queue request. /merge -f --reason "reason" skips all merge queue checks; please use this judiciously, as some checks do not run at the PR-level. For more information, see this doc.

pr-commenter · 2026-02-19T22:33:17Z

Benchmarks

Startup

Parameters

	Baseline	Candidate
baseline_or_candidate	baseline	candidate
git_branch	master	ygree/llmobs-systest-fixes
git_commit_date	1773939812	1774498101
git_commit_sha	`5580c61`	`9911c51`
release_version	1.61.0-SNAPSHOT~5580c61ac4	1.60.0-SNAPSHOT~9911c514e7

See matching parameters

	Baseline	Candidate
application	insecure-bank	insecure-bank
ci_job_date	1774499972	1774499972
ci_job_id	1539987088	1539987088
ci_pipeline_id	104461511	104461511
cpu_model	Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz	Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
kernel_version	Linux runner-zfyrx7zua-project-304-concurrent-0-c6083ol5 6.8.0-1031-aws #33~22.04.1-Ubuntu SMP Thu Jun 26 14:22:30 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux	Linux runner-zfyrx7zua-project-304-concurrent-0-c6083ol5 6.8.0-1031-aws #33~22.04.1-Ubuntu SMP Thu Jun 26 14:22:30 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
module	Agent	Agent
parent	None	None

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 63 metrics, 8 unstable metrics.

Startup time reports for insecure-bank

gantt
    title insecure-bank - global startup overhead: candidate=1.60.0-SNAPSHOT~9911c514e7, baseline=1.61.0-SNAPSHOT~5580c61ac4

    dateFormat X
    axisFormat %s
section tracing
Agent [baseline] (1.057 s) : 0, 1057214
Total [baseline] (8.82 s) : 0, 8819984
Agent [candidate] (1.055 s) : 0, 1055114
Total [candidate] (8.825 s) : 0, 8825318
section iast
Agent [baseline] (1.235 s) : 0, 1234646
Total [baseline] (9.572 s) : 0, 9572364
Agent [candidate] (1.223 s) : 0, 1222502
Total [candidate] (9.513 s) : 0, 9512836

baseline results

Module	Variant	Duration	Δ tracing
Agent	tracing	1.057 s	-
Agent	iast	1.235 s	177.431 ms (16.8%)
Total	tracing	8.82 s	-
Total	iast	9.572 s	752.38 ms (8.5%)

candidate results

Module	Variant	Duration	Δ tracing
Agent	tracing	1.055 s	-
Agent	iast	1.223 s	167.388 ms (15.9%)
Total	tracing	8.825 s	-
Total	iast	9.513 s	687.518 ms (7.8%)

gantt
    title insecure-bank - break down per module: candidate=1.60.0-SNAPSHOT~9911c514e7, baseline=1.61.0-SNAPSHOT~5580c61ac4

    dateFormat X
    axisFormat %s
section tracing
crashtracking [baseline] (1.201 ms) : 0, 1201
crashtracking [candidate] (1.186 ms) : 0, 1186
BytebuddyAgent [baseline] (629.845 ms) : 0, 629845
BytebuddyAgent [candidate] (627.923 ms) : 0, 627923
AgentMeter [baseline] (29.345 ms) : 0, 29345
AgentMeter [candidate] (29.337 ms) : 0, 29337
GlobalTracer [baseline] (255.926 ms) : 0, 255926
GlobalTracer [candidate] (256.145 ms) : 0, 256145
AppSec [baseline] (31.594 ms) : 0, 31594
AppSec [candidate] (31.68 ms) : 0, 31680
Debugger [baseline] (59.566 ms) : 0, 59566
Debugger [candidate] (59.394 ms) : 0, 59394
Remote Config [baseline] (582.515 µs) : 0, 583
Remote Config [candidate] (577.755 µs) : 0, 578
Telemetry [baseline] (8.051 ms) : 0, 8051
Telemetry [candidate] (7.984 ms) : 0, 7984
Flare Poller [baseline] (5.031 ms) : 0, 5031
Flare Poller [candidate] (4.941 ms) : 0, 4941
section iast
crashtracking [baseline] (1.199 ms) : 0, 1199
crashtracking [candidate] (1.186 ms) : 0, 1186
BytebuddyAgent [baseline] (801.59 ms) : 0, 801590
BytebuddyAgent [candidate] (793.791 ms) : 0, 793791
AgentMeter [baseline] (11.592 ms) : 0, 11592
AgentMeter [candidate] (11.341 ms) : 0, 11341
GlobalTracer [baseline] (248.392 ms) : 0, 248392
GlobalTracer [candidate] (246.235 ms) : 0, 246235
IAST [baseline] (25.584 ms) : 0, 25584
IAST [candidate] (25.348 ms) : 0, 25348
AppSec [baseline] (26.757 ms) : 0, 26757
AppSec [candidate] (26.361 ms) : 0, 26361
Debugger [baseline] (68.89 ms) : 0, 68890
Debugger [candidate] (68.68 ms) : 0, 68680
Remote Config [baseline] (535.563 µs) : 0, 536
Remote Config [candidate] (525.583 µs) : 0, 526
Telemetry [baseline] (10.237 ms) : 0, 10237
Telemetry [candidate] (9.616 ms) : 0, 9616
Flare Poller [baseline] (3.722 ms) : 0, 3722
Flare Poller [candidate] (3.463 ms) : 0, 3463

Startup time reports for petclinic

gantt
    title petclinic - global startup overhead: candidate=1.60.0-SNAPSHOT~9911c514e7, baseline=1.61.0-SNAPSHOT~5580c61ac4

    dateFormat X
    axisFormat %s
section tracing
Agent [baseline] (1.062 s) : 0, 1062086
Total [baseline] (10.935 s) : 0, 10934666
Agent [candidate] (1.056 s) : 0, 1056021
Total [candidate] (11.036 s) : 0, 11036221
section appsec
Agent [baseline] (1.244 s) : 0, 1244384
Total [baseline] (11.223 s) : 0, 11222893
Agent [candidate] (1.243 s) : 0, 1243294
Total [candidate] (11.251 s) : 0, 11251193
section iast
Agent [baseline] (1.227 s) : 0, 1227499
Total [baseline] (11.381 s) : 0, 11380525
Agent [candidate] (1.228 s) : 0, 1227712
Total [candidate] (11.237 s) : 0, 11236774
section profiling
Agent [baseline] (1.182 s) : 0, 1182316
Total [baseline] (11.027 s) : 0, 11026737
Agent [candidate] (1.179 s) : 0, 1179376
Total [candidate] (11.051 s) : 0, 11050740

baseline results

Module	Variant	Duration	Δ tracing
Agent	tracing	1.062 s	-
Agent	appsec	1.244 s	182.299 ms (17.2%)
Agent	iast	1.227 s	165.413 ms (15.6%)
Agent	profiling	1.182 s	120.23 ms (11.3%)
Total	tracing	10.935 s	-
Total	appsec	11.223 s	288.227 ms (2.6%)
Total	iast	11.381 s	445.86 ms (4.1%)
Total	profiling	11.027 s	92.071 ms (0.8%)

candidate results

Module	Variant	Duration	Δ tracing
Agent	tracing	1.056 s	-
Agent	appsec	1.243 s	187.274 ms (17.7%)
Agent	iast	1.228 s	171.692 ms (16.3%)
Agent	profiling	1.179 s	123.355 ms (11.7%)
Total	tracing	11.036 s	-
Total	appsec	11.251 s	214.972 ms (1.9%)
Total	iast	11.237 s	200.553 ms (1.8%)
Total	profiling	11.051 s	14.518 ms (0.1%)

gantt
    title petclinic - break down per module: candidate=1.60.0-SNAPSHOT~9911c514e7, baseline=1.61.0-SNAPSHOT~5580c61ac4

    dateFormat X
    axisFormat %s
section tracing
crashtracking [baseline] (1.199 ms) : 0, 1199
crashtracking [candidate] (1.191 ms) : 0, 1191
BytebuddyAgent [baseline] (631.089 ms) : 0, 631089
BytebuddyAgent [candidate] (629.385 ms) : 0, 629385
AgentMeter [baseline] (29.417 ms) : 0, 29417
AgentMeter [candidate] (29.236 ms) : 0, 29236
GlobalTracer [baseline] (257.28 ms) : 0, 257280
GlobalTracer [candidate] (255.533 ms) : 0, 255533
AppSec [baseline] (31.907 ms) : 0, 31907
AppSec [candidate] (31.686 ms) : 0, 31686
Debugger [baseline] (60.684 ms) : 0, 60684
Debugger [candidate] (60.056 ms) : 0, 60056
Remote Config [baseline] (592.739 µs) : 0, 593
Remote Config [candidate] (584.264 µs) : 0, 584
Telemetry [baseline] (8.084 ms) : 0, 8084
Telemetry [candidate] (8.794 ms) : 0, 8794
Flare Poller [baseline] (5.845 ms) : 0, 5845
Flare Poller [candidate] (3.546 ms) : 0, 3546
section appsec
crashtracking [baseline] (1.197 ms) : 0, 1197
crashtracking [candidate] (1.185 ms) : 0, 1185
BytebuddyAgent [baseline] (655.626 ms) : 0, 655626
BytebuddyAgent [candidate] (656.242 ms) : 0, 656242
AgentMeter [baseline] (12.134 ms) : 0, 12134
AgentMeter [candidate] (12.08 ms) : 0, 12080
GlobalTracer [baseline] (258.025 ms) : 0, 258025
GlobalTracer [candidate] (257.728 ms) : 0, 257728
IAST [baseline] (24.346 ms) : 0, 24346
IAST [candidate] (24.138 ms) : 0, 24138
AppSec [baseline] (178.074 ms) : 0, 178074
AppSec [candidate] (177.14 ms) : 0, 177140
Debugger [baseline] (66.156 ms) : 0, 66156
Debugger [candidate] (66.06 ms) : 0, 66060
Remote Config [baseline] (633.937 µs) : 0, 634
Remote Config [candidate] (628.381 µs) : 0, 628
Telemetry [baseline] (8.38 ms) : 0, 8380
Telemetry [candidate] (8.335 ms) : 0, 8335
Flare Poller [baseline] (3.605 ms) : 0, 3605
Flare Poller [candidate] (3.64 ms) : 0, 3640
section iast
crashtracking [baseline] (1.192 ms) : 0, 1192
crashtracking [candidate] (1.227 ms) : 0, 1227
BytebuddyAgent [baseline] (796.766 ms) : 0, 796766
BytebuddyAgent [candidate] (797.451 ms) : 0, 797451
AgentMeter [baseline] (11.444 ms) : 0, 11444
AgentMeter [candidate] (11.399 ms) : 0, 11399
GlobalTracer [baseline] (246.634 ms) : 0, 246634
GlobalTracer [candidate] (246.92 ms) : 0, 246920
IAST [baseline] (25.312 ms) : 0, 25312
IAST [candidate] (25.298 ms) : 0, 25298
AppSec [baseline] (26.417 ms) : 0, 26417
AppSec [candidate] (26.383 ms) : 0, 26383
Debugger [baseline] (69.876 ms) : 0, 69876
Debugger [candidate] (70.147 ms) : 0, 70147
Remote Config [baseline] (526.764 µs) : 0, 527
Remote Config [candidate] (517.289 µs) : 0, 517
Telemetry [baseline] (9.725 ms) : 0, 9725
Telemetry [candidate] (9.1 ms) : 0, 9100
Flare Poller [baseline] (3.528 ms) : 0, 3528
Flare Poller [candidate] (3.295 ms) : 0, 3295
section profiling
crashtracking [baseline] (1.165 ms) : 0, 1165
crashtracking [candidate] (1.17 ms) : 0, 1170
BytebuddyAgent [baseline] (682.827 ms) : 0, 682827
BytebuddyAgent [candidate] (680.691 ms) : 0, 680691
AgentMeter [baseline] (9.009 ms) : 0, 9009
AgentMeter [candidate] (9.022 ms) : 0, 9022
GlobalTracer [baseline] (215.426 ms) : 0, 215426
GlobalTracer [candidate] (214.949 ms) : 0, 214949
AppSec [baseline] (32.057 ms) : 0, 32057
AppSec [candidate] (32.013 ms) : 0, 32013
Debugger [baseline] (64.922 ms) : 0, 64922
Debugger [candidate] (65.816 ms) : 0, 65816
Remote Config [baseline] (562.091 µs) : 0, 562
Remote Config [candidate] (556.071 µs) : 0, 556
Telemetry [baseline] (8.52 ms) : 0, 8520
Telemetry [candidate] (7.645 ms) : 0, 7645
Flare Poller [baseline] (3.434 ms) : 0, 3434
Flare Poller [candidate] (3.452 ms) : 0, 3452
ProfilingAgent [baseline] (93.505 ms) : 0, 93505
ProfilingAgent [candidate] (93.457 ms) : 0, 93457
Profiling [baseline] (94.069 ms) : 0, 94069
Profiling [candidate] (94.019 ms) : 0, 94019

Load

Parameters

	Baseline	Candidate
baseline_or_candidate	baseline	candidate
git_branch	master	ygree/llmobs-systest-fixes
git_commit_date	1773939812	1774498101
git_commit_sha	`5580c61`	`9911c51`
release_version	1.61.0-SNAPSHOT~5580c61ac4	1.60.0-SNAPSHOT~9911c514e7

See matching parameters

	Baseline	Candidate
application	insecure-bank	insecure-bank
ci_job_date	1774500432	1774500432
ci_job_id	1539987089	1539987089
ci_pipeline_id	104461511	104461511
cpu_model	Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz	Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
kernel_version	Linux runner-zfyrx7zua-project-304-concurrent-0-yl9bkdb2 6.8.0-1031-aws #33~22.04.1-Ubuntu SMP Thu Jun 26 14:22:30 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux	Linux runner-zfyrx7zua-project-304-concurrent-0-yl9bkdb2 6.8.0-1031-aws #33~22.04.1-Ubuntu SMP Thu Jun 26 14:22:30 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

Summary

Found 1 performance improvements and 3 performance regressions! Performance is the same for 17 metrics, 15 unstable metrics.

scenario	Δ mean agg_http_req_duration_p50	Δ mean agg_http_req_duration_p95	Δ mean throughput	candidate mean agg_http_req_duration_p50	candidate mean agg_http_req_duration_p95	candidate mean throughput	baseline mean agg_http_req_duration_p50	baseline mean agg_http_req_duration_p95	baseline mean throughput
scenario:load:insecure-bank:iast:high_load	better [-172.877µs; -98.795µs] or [-6.774%; -3.871%]	unsure [-432.860µs; -65.137µs] or [-5.866%; -0.883%]	unstable [-102.590op/s; +212.653op/s] or [-7.315%; +15.162%]	2.416ms	7.131ms	1457.531op/s	2.552ms	7.380ms	1402.500op/s
scenario:load:petclinic:profiling:high_load	worse [+433.580µs; +1090.144µs] or [+2.346%; +5.899%]	same [-378.632µs; +1043.903µs] or [-1.253%; +3.455%]	unstable [-30.953op/s; +14.141op/s] or [-12.503%; +5.712%]	19.241ms	30.549ms	239.156op/s	18.480ms	30.216ms	247.562op/s
scenario:load:petclinic:no_agent:high_load	worse [+1.111ms; +2.215ms] or [+6.733%; +13.417%]	worse [+1.380ms; +3.613ms] or [+4.934%; +12.918%]	unstable [-49.267op/s; +1.642op/s] or [-18.030%; +0.601%]	18.172ms	30.462ms	249.438op/s	16.508ms	27.965ms	273.250op/s

Request duration reports for petclinic

gantt
    title petclinic - request duration [CI 0.99] : candidate=1.60.0-SNAPSHOT~9911c514e7, baseline=1.61.0-SNAPSHOT~5580c61ac4
    dateFormat X
    axisFormat %s
section baseline
no_agent (17.07 ms) : 16902, 17238
.   : milestone, 17070,
appsec (18.737 ms) : 18549, 18925
.   : milestone, 18737,
code_origins (18.867 ms) : 18679, 19055
.   : milestone, 18867,
iast (17.584 ms) : 17409, 17759
.   : milestone, 17584,
profiling (18.853 ms) : 18662, 19043
.   : milestone, 18853,
tracing (17.549 ms) : 17376, 17722
.   : milestone, 17549,
section candidate
no_agent (18.71 ms) : 18521, 18899
.   : milestone, 18710,
appsec (18.564 ms) : 18376, 18752
.   : milestone, 18564,
code_origins (18.94 ms) : 18752, 19128
.   : milestone, 18940,
iast (17.566 ms) : 17389, 17742
.   : milestone, 17566,
profiling (19.516 ms) : 19323, 19710
.   : milestone, 19516,
tracing (17.828 ms) : 17652, 18004
.   : milestone, 17828,

baseline results

Variant	Request duration [CI 0.99]	Δ no_agent
no_agent	17.07 ms [16.902 ms, 17.238 ms]	-
appsec	18.737 ms [18.549 ms, 18.925 ms]	1.667 ms (9.8%)
code_origins	18.867 ms [18.679 ms, 19.055 ms]	1.797 ms (10.5%)
iast	17.584 ms [17.409 ms, 17.759 ms]	513.509 µs (3.0%)
profiling	18.853 ms [18.662 ms, 19.043 ms]	1.782 ms (10.4%)
tracing	17.549 ms [17.376 ms, 17.722 ms]	478.165 µs (2.8%)

candidate results

Variant	Request duration [CI 0.99]	Δ no_agent
no_agent	18.71 ms [18.521 ms, 18.899 ms]	-
appsec	18.564 ms [18.376 ms, 18.752 ms]	-146.064 µs (-0.8%)
code_origins	18.94 ms [18.752 ms, 19.128 ms]	230.27 µs (1.2%)
iast	17.566 ms [17.389 ms, 17.742 ms]	-1.144 ms (-6.1%)
profiling	19.516 ms [19.323 ms, 19.71 ms]	806.411 µs (4.3%)
tracing	17.828 ms [17.652 ms, 18.004 ms]	-881.815 µs (-4.7%)

Request duration reports for insecure-bank

gantt
    title insecure-bank - request duration [CI 0.99] : candidate=1.60.0-SNAPSHOT~9911c514e7, baseline=1.61.0-SNAPSHOT~5580c61ac4
    dateFormat X
    axisFormat %s
section baseline
no_agent (1.197 ms) : 1185, 1209
.   : milestone, 1197,
iast (3.263 ms) : 3217, 3308
.   : milestone, 3263,
iast_FULL (5.858 ms) : 5800, 5917
.   : milestone, 5858,
iast_GLOBAL (3.519 ms) : 3467, 3570
.   : milestone, 3519,
profiling (2.161 ms) : 2141, 2181
.   : milestone, 2161,
tracing (1.786 ms) : 1771, 1802
.   : milestone, 1786,
section candidate
no_agent (1.182 ms) : 1170, 1194
.   : milestone, 1182,
iast (3.138 ms) : 3093, 3182
.   : milestone, 3138,
iast_FULL (6.043 ms) : 5981, 6104
.   : milestone, 6043,
iast_GLOBAL (3.605 ms) : 3548, 3662
.   : milestone, 3605,
profiling (2.127 ms) : 2108, 2146
.   : milestone, 2127,
tracing (1.782 ms) : 1767, 1796
.   : milestone, 1782,

baseline results

Variant	Request duration [CI 0.99]	Δ no_agent
no_agent	1.197 ms [1.185 ms, 1.209 ms]	-
iast	3.263 ms [3.217 ms, 3.308 ms]	2.066 ms (172.6%)
iast_FULL	5.858 ms [5.8 ms, 5.917 ms]	4.661 ms (389.5%)
iast_GLOBAL	3.519 ms [3.467 ms, 3.57 ms]	2.322 ms (194.0%)
profiling	2.161 ms [2.141 ms, 2.181 ms]	964.394 µs (80.6%)
tracing	1.786 ms [1.771 ms, 1.802 ms]	589.556 µs (49.3%)

candidate results

Variant	Request duration [CI 0.99]	Δ no_agent
no_agent	1.182 ms [1.17 ms, 1.194 ms]	-
iast	3.138 ms [3.093 ms, 3.182 ms]	1.956 ms (165.6%)
iast_FULL	6.043 ms [5.981 ms, 6.104 ms]	4.861 ms (411.4%)
iast_GLOBAL	3.605 ms [3.548 ms, 3.662 ms]	2.423 ms (205.1%)
profiling	2.127 ms [2.108 ms, 2.146 ms]	945.62 µs (80.0%)
tracing	1.782 ms [1.767 ms, 1.796 ms]	600.258 µs (50.8%)

Dacapo

Parameters

	Baseline	Candidate
baseline_or_candidate	baseline	candidate
git_branch	master	ygree/llmobs-systest-fixes
git_commit_date	1773939812	1774498101
git_commit_sha	`5580c61`	`9911c51`
release_version	1.61.0-SNAPSHOT~5580c61ac4	1.60.0-SNAPSHOT~9911c514e7

See matching parameters

	Baseline	Candidate
application	biojava	biojava
ci_job_date	1774500181	1774500181
ci_job_id	1539987090	1539987090
ci_pipeline_id	104461511	104461511
cpu_model	Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz	Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
kernel_version	Linux runner-zfyrx7zua-project-304-concurrent-0-eylivf88 6.8.0-1031-aws #33~22.04.1-Ubuntu SMP Thu Jun 26 14:22:30 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux	Linux runner-zfyrx7zua-project-304-concurrent-0-eylivf88 6.8.0-1031-aws #33~22.04.1-Ubuntu SMP Thu Jun 26 14:22:30 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 12 metrics, 0 unstable metrics.

Execution time for biojava

gantt
    title biojava - execution time [CI 0.99] : candidate=1.60.0-SNAPSHOT~9911c514e7, baseline=1.61.0-SNAPSHOT~5580c61ac4
    dateFormat X
    axisFormat %s
section baseline
no_agent (15.596 s) : 15596000, 15596000
.   : milestone, 15596000,
appsec (14.737 s) : 14737000, 14737000
.   : milestone, 14737000,
iast (18.122 s) : 18122000, 18122000
.   : milestone, 18122000,
iast_GLOBAL (18.018 s) : 18018000, 18018000
.   : milestone, 18018000,
profiling (15.39 s) : 15390000, 15390000
.   : milestone, 15390000,
tracing (15.033 s) : 15033000, 15033000
.   : milestone, 15033000,
section candidate
no_agent (15.684 s) : 15684000, 15684000
.   : milestone, 15684000,
appsec (14.887 s) : 14887000, 14887000
.   : milestone, 14887000,
iast (18.409 s) : 18409000, 18409000
.   : milestone, 18409000,
iast_GLOBAL (17.712 s) : 17712000, 17712000
.   : milestone, 17712000,
profiling (14.915 s) : 14915000, 14915000
.   : milestone, 14915000,
tracing (14.834 s) : 14834000, 14834000
.   : milestone, 14834000,

baseline results

Variant	Execution Time [CI 0.99]	Δ no_agent
no_agent	15.596 s [15.596 s, 15.596 s]	-
appsec	14.737 s [14.737 s, 14.737 s]	-859.0 ms (-5.5%)
iast	18.122 s [18.122 s, 18.122 s]	2.526 s (16.2%)
iast_GLOBAL	18.018 s [18.018 s, 18.018 s]	2.422 s (15.5%)
profiling	15.39 s [15.39 s, 15.39 s]	-206.0 ms (-1.3%)
tracing	15.033 s [15.033 s, 15.033 s]	-563.0 ms (-3.6%)

candidate results

Variant	Execution Time [CI 0.99]	Δ no_agent
no_agent	15.684 s [15.684 s, 15.684 s]	-
appsec	14.887 s [14.887 s, 14.887 s]	-797.0 ms (-5.1%)
iast	18.409 s [18.409 s, 18.409 s]	2.725 s (17.4%)
iast_GLOBAL	17.712 s [17.712 s, 17.712 s]	2.028 s (12.9%)
profiling	14.915 s [14.915 s, 14.915 s]	-769.0 ms (-4.9%)
tracing	14.834 s [14.834 s, 14.834 s]	-850.0 ms (-5.4%)

Execution time for tomcat

gantt
    title tomcat - execution time [CI 0.99] : candidate=1.60.0-SNAPSHOT~9911c514e7, baseline=1.61.0-SNAPSHOT~5580c61ac4
    dateFormat X
    axisFormat %s
section baseline
no_agent (1.474 ms) : 1463, 1486
.   : milestone, 1474,
appsec (2.522 ms) : 2467, 2576
.   : milestone, 2522,
iast (2.259 ms) : 2190, 2328
.   : milestone, 2259,
iast_GLOBAL (2.298 ms) : 2229, 2367
.   : milestone, 2298,
profiling (2.081 ms) : 2026, 2135
.   : milestone, 2081,
tracing (2.064 ms) : 2010, 2117
.   : milestone, 2064,
section candidate
no_agent (1.479 ms) : 1467, 1490
.   : milestone, 1479,
appsec (2.565 ms) : 2508, 2622
.   : milestone, 2565,
iast (2.253 ms) : 2184, 2322
.   : milestone, 2253,
iast_GLOBAL (2.3 ms) : 2231, 2369
.   : milestone, 2300,
profiling (2.077 ms) : 2023, 2132
.   : milestone, 2077,
tracing (2.067 ms) : 2014, 2120
.   : milestone, 2067,

baseline results

Variant	Execution Time [CI 0.99]	Δ no_agent
no_agent	1.474 ms [1.463 ms, 1.486 ms]	-
appsec	2.522 ms [2.467 ms, 2.576 ms]	1.047 ms (71.1%)
iast	2.259 ms [2.19 ms, 2.328 ms]	784.908 µs (53.2%)
iast_GLOBAL	2.298 ms [2.229 ms, 2.367 ms]	823.816 µs (55.9%)
profiling	2.081 ms [2.026 ms, 2.135 ms]	606.4 µs (41.1%)
tracing	2.064 ms [2.01 ms, 2.117 ms]	589.367 µs (40.0%)

candidate results

Variant	Execution Time [CI 0.99]	Δ no_agent
no_agent	1.479 ms [1.467 ms, 1.49 ms]	-
appsec	2.565 ms [2.508 ms, 2.622 ms]	1.086 ms (73.5%)
iast	2.253 ms [2.184 ms, 2.322 ms]	774.357 µs (52.4%)
iast_GLOBAL	2.3 ms [2.231 ms, 2.369 ms]	821.441 µs (55.5%)
profiling	2.077 ms [2.023 ms, 2.132 ms]	598.211 µs (40.5%)
tracing	2.067 ms [2.014 ms, 2.12 ms]	588.252 µs (39.8%)

…wthTestOpenAiLlmInteractions::test_completion

…teractions::test_chat_completion_tool_call

…d with python openai instrumentation and system-tests

… with variables + chat_template, longest-first overlap handling) and support map-based LLM input serialization (messages + prompt) in LLMObs mapper. Also filter empty instruction messages to match system-test expectations.

…st and return [image] (not empty) when stripped input_image URLs are missing, aligning mixed-input chat_template output with expected behavior.

…output.messages from request params so existing error-span tests pass.

…ol_definitions tags

…JSON argument parsing and remove duplicate manual parsing logic from ResponseDecorator.

ygree · 2026-03-17T22:11:55Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0c879ba692

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

...enai-java-3.0/src/main/java/datadog/trace/instrumentation/openai_java/ResponseDecorator.java

Kyle-Verhoog

LLMObs Team Review

Nice work aligning the Java SDK payloads with the intake schema — this is a big step for system test compliance. A few items to address/clarify below (inline), plus some overall notes:

Test Coverage Notes

What's well-covered: LLMObsSpanMapperTest expansion is great — covers _dd map, nested meta.error, map-based input with prompt/chat_template, tool definitions, tool calls + tool results. The decorator tests verify the new tags (source, integration, error, ddtrace.version).

Gaps to consider:

Error paths: No test exercises the error-path defaults (model_name and empty output set during withResponseCreateParams when the HTTP call fails). A test where the response errors out and verifying the span still has model_name and placeholder output would be valuable.
Prompt tracking: enrichInputWithPromptTracking(), extractChatTemplate(), extractPromptFromParams(), and normalizePromptVariable() have no unit tests. Template variable replacement edge cases (overlapping values, empty variables, image/file fallbacks) would increase confidence.
Custom/MCP tool calls: ToolCallExtractor.getToolCall(ResponseCustomToolCall) and getToolCall(McpCall) are new with no unit tests.
JsonValueUtils: New utility class with no dedicated tests for recursive JSON-to-Object conversion.

Questions

The min version bump from 3.0.0 to 3.0.1 — what API was missing in 3.0.0? This affects which customer versions get instrumented.
For the _dd map — does the intake expect apm_trace_id to equal trace_id? In other SDKs these can differ (APM trace ID vs LLMObs ID).

Kyle-Verhoog · 2026-03-22T04:34:50Z

.github/workflows/run-system-tests.yaml

    # If you change the following comment, update the pattern in the update_system_test_reference.sh script to match.
-    uses: DataDog/system-tests/.github/workflows/system-tests.yml@main # system tests are pinned for releases only
+    uses: DataDog/system-tests/.github/workflows/system-tests.yml@ea458202a7673efbe365e498d64d74a815c0a137 # system tests are pinned for releases only
    secrets:


System tests are pinned to commit ea458202 instead of main. The comment says "system tests are pinned for releases only." This should be reverted to main before merge to avoid blocking future system test updates. If it's intentional for CI during development, just make sure it goes back before merging.

Yes I also think this is a bad manipulation (hence my review is only limited to this file :) )

Yes, it was intentional and has just been reverted. The reason for this was to verify that the system tests pass with changes that are not yet part of the main branch: DataDog/system-tests#6364 Without that, none of the related tests would run at all.

Kyle-Verhoog · 2026-03-22T04:34:50Z

dd-trace-core/src/main/java/datadog/trace/llmobs/writer/ddintake/LLMObsSpanMapper.java

-      writable.writeString(errored ? "error" : "ok", null);
+      writable.writeUTF8(DD);
+      writable.startMap(3);
+      writable.writeUTF8(SPAN_ID);


The _dd map writes span_id, trace_id, and apm_trace_id. The Python SDK also includes t_id (64-bit trace ID) and s_id in the _dd map. Can you verify this is the correct subset for the Java path? If the intake expects additional fields, spans may be rejected or processed incorrectly.

This is aligned with dd-trace-py https://github.com/DataDog/dd-trace-py/blob/876c5f1ce4d173815537798a6a7b0ac15b0a4ede/ddtrace/llmobs/_llmobs.py#L618-L622. I don't find any t_id or s_id there.

Kyle-Verhoog · 2026-03-22T04:34:50Z

dd-trace-core/src/main/java/datadog/trace/llmobs/writer/ddintake/LLMObsSpanMapper.java

-
-      boolean errored = span.getError() == 1;
+      writable.writeUTF8(STATUS);
+      writable.writeString(span.getError() == 0 ? "ok" : "error", null);


The top-level error: 0/1 integer field has been removed and replaced with status: "ok"/"error" + error details nested under meta.error. Can you confirm no downstream consumers (EvP remapper, indexer facets, etc.) read error from the top level? This is a payload shape change that could be breaking if anything depends on the old field.

This change is dictated by the TestOpenAiLlmInteractions::test_chat_completion assertion. I assume that the system test assertions are correct. Have they been verified as being compliant with the requirements of downstream consumers?

Kyle-Verhoog · 2026-03-22T04:34:50Z

...enai-java-3.0/src/main/java/datadog/trace/instrumentation/openai_java/ResponseDecorator.java

+        }
+      }
+    } catch (Throwable ignored) {
+      // fall back to raw JSON if typed extraction is unavailable or fails


nit: catch (Throwable) swallows OutOfMemoryError, StackOverflowError, etc. Consider narrowing to catch (Exception) for both fallback paths here.

Kyle-Verhoog · 2026-03-22T04:34:50Z

dd-trace-core/src/main/java/datadog/trace/llmobs/writer/ddintake/LLMObsSpanMapper.java

+        boolean hasToolCalls = null != toolCalls && !toolCalls.isEmpty();
+        boolean hasToolResults = null != toolResults && !toolResults.isEmpty();
+        boolean hasContent = message.getContent() != null;
+        int mapSize = 1;


Behavioral change: previously content was always written (even as null). Now it's skipped when message.getContent() == null (e.g., tool-call-only messages). This is likely correct and matches Python SDK behavior, but worth confirming the intake handles messages without a content key.

This change is driven by TestOpenAiResponses::test_responses_create_tool_call. When the content is null, it is expected to be missing; otherwise, the assertion fails.

ygree · 2026-03-24T19:53:07Z

dd-java-agent/instrumentation/openai-java/openai-java-3.0/build.gradle

 apply from: "$rootDir/gradle/java.gradle"

-def minVer = '3.0.0'
+def minVer = '3.0.1'


ResponseTextConfig fun verbosity(): Optional<Verbosity> was added in 3.0.1 openai/openai-java@c1de354#diff-6b385fb153d457757ba112e6117593cb59da6af308cce0f9b6f26e3885befc6cR73

ygree · 2026-03-24T22:25:05Z

Questions

The min version bump from 3.0.0 to 3.0.1 — what API was missing in 3.0.0? This affects which customer versions get instrumented.

ResponseTextConfig fun verbosity(): Optional was added in 3.0.1 openai/openai-java@c1de354#diff-6b385fb153d457757ba112e6117593cb59da6af308cce0f9b6f26e3885befc6cR73

For the _dd map — does the intake expect apm_trace_id to equal trace_id? In other SDKs these can differ (APM trace ID vs LLMObs ID).

This is aligned with dd-trace-py https://github.com/DataDog/dd-trace-py/blob/876c5f1ce4d173815537798a6a7b0ac15b0a4ede/ddtrace/llmobs/_llmobs.py#L618-L622.

…and placeholder output set by withResponseCreateParams.

…f enrichInputWithPromptTracking(), extractChatTemplate(), extractPromptFromParams(), and normalizePromptVariable()

…of getToolCall

… format. Test cover extractPromptFromParams and related methods

ygree self-assigned this Feb 19, 2026

ygree added comp: mlobs ML Observability (LLMObs) type: bug Bug report and fix labels Feb 19, 2026

llmobs: set model tag even when llmobs disabled

cbd6226

ygree force-pushed the ygree/llmobs-systest-fixes branch from 5cd257e to cbd6226 Compare February 24, 2026 09:31

ygree changed the title ~~llmobs: set model tag even when llmobs disabled~~ fix(llmobs): set model tag even when llmobs disabled Mar 2, 2026

ygree added 23 commits March 2, 2026 13:30

Set metadata.stream tag no matter it's true or false

4f27673

Set chat/completion CACHE_READ_INPUT_TOKENS tag

d128d6b

Set error nad error_type tags

3fc5ceb

Use "" instead of null for the role in CompletionDecorator to comply …

021a9d1

…wthTestOpenAiLlmInteractions::test_completion

Use "" instead of null for the content to comply with TestOpenAiLlmIn…

0637931

…teractions::test_chat_completion_tool_call

Add missing metatadata.tool_choice

0cb41e1

Add missing tool_definitions

a42f8aa

Add source:integration tag

6e10255

Add missing _dd attribute to the llmobs span event

34f3a07

Add missing error tags

a0c1139

Remove error from the llmobs span event. It must be part of meta block

effc343

Add missing meta.text.verbosity

c0e3876

Add summaryText and encrypted_content

b000770

Add missing tool_calls and tool_results for responses

53471a2

Always set stream param to produce the same request body to be aligne…

2207c46

…d with python openai instrumentation and system-tests

Fix OpenAI Responses prompt tracking to use response instructions fir…

7d683b6

…st and return [image] (not empty) when stripped input_image URLs are missing, aligning mixed-input chat_template output with expected behavior.

Set LLMObs error-path defaults in Java to always emit model_name and …

2c17ddc

…output.messages from request params so existing error-span tests pass.

Add OpenAI Responses tool definition extraction to populate LLMObs to…

ad3b782

…ol_definitions tags

Fix ChatCompletionServiceTest

1810327

Extract JsonValueUtils

46221e4

Refactor OpenAI responses instrumentation to reuse ToolCallExtractor …

61ad667

…JSON argument parsing and remove duplicate manual parsing logic from ResponseDecorator.

Fix test assertions

f0957b7

ygree added tag: ai generated Largely based on code generated by an AI or LLM tag: no release notes Changes to exclude from release notes labels Mar 6, 2026

ygree marked this pull request as ready for review March 6, 2026 13:46

ygree requested review from a team as code owners March 6, 2026 13:46

chatgpt-codex-connector bot reviewed Mar 17, 2026

View reviewed changes

...enai-java-3.0/src/main/java/datadog/trace/instrumentation/openai_java/ResponseDecorator.java Outdated Show resolved Hide resolved

...enai-java-3.0/src/main/java/datadog/trace/instrumentation/openai_java/ResponseDecorator.java Outdated Show resolved Hide resolved

ygree added 4 commits March 17, 2026 17:35

Include input messages when instructions are present in prompt tracking

f4e3a8b

Fix instructions role to system in prompt tracking

028d64f

Merge branch 'master' into ygree/llmobs-systest-fixes

82f4303

fix LLMObsSpanMapperTest

717a8f0

ygree requested a review from a team as a code owner March 20, 2026 00:37

ygree requested review from amarziali and removed request for a team March 20, 2026 00:37

ygree removed the tag: no release notes Changes to exclude from release notes label Mar 20, 2026

Kyle-Verhoog reviewed Mar 22, 2026

View reviewed changes

ygree force-pushed the ygree/llmobs-systest-fixes branch from 6dcdaf4 to 717a8f0 Compare March 24, 2026 18:43

ygree commented Mar 24, 2026

View reviewed changes

Catch exception not throwable

8420f0a

ygree added 7 commits March 24, 2026 16:00

Add JsonValueUtilsTest

91707fa

Test that on HTTP error, the OpenAI response span retains model_name …

3d12515

…and placeholder output set by withResponseCreateParams.

Add "create response with prompt tracking" test to improve coverage o…

576cec7

…f enrichInputWithPromptTracking(), extractChatTemplate(), extractPromptFromParams(), and normalizePromptVariable()

Add "create response with custom tool call" test to improve coverage …

ba0cb27

…of getToolCall

Prevent NPE when tag value is null

8be92d7

Replace catch Throwable with catch Exception

1036ed4

responseCreateParamsWithPromptTracking support both known and unknown…

9911c51

… format. Test cover extractPromptFromParams and related methods

ygree requested a review from a team as a code owner March 26, 2026 04:08

ygree added this to the 1.61.0 milestone Mar 26, 2026

ygree requested a review from Kyle-Verhoog March 26, 2026 18:13

Conversation

ygree commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What Does This Do

Motivation

Additional Notes

Contributor Checklist

Uh oh!

pr-commenter bot commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks

Startup

Parameters

Summary

Load

Parameters

Summary

Dacapo

Parameters

Summary

Uh oh!

ygree commented Mar 17, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Kyle-Verhoog left a comment

Choose a reason for hiding this comment

LLMObs Team Review

Test Coverage Notes

Questions

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ygree Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ygree Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ygree Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ygree commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Questions

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ygree commented Feb 19, 2026 •

edited

Loading

pr-commenter bot commented Feb 19, 2026 •

edited

Loading

ygree Mar 24, 2026 •

edited

Loading

ygree Mar 24, 2026 •

edited

Loading

ygree Mar 24, 2026 •

edited

Loading

ygree commented Mar 24, 2026 •

edited

Loading